System and method for assessing quality of a singing voice

ABSTRACT

Disclosed is a system for assessing quality of a singing voice singing a song. The system comprises memory and at least one processor. The memory stores instructions that, when executed by the at least one processor, cause the at least one processor to receive a plurality of inputs comprising a first input and one or more further inputs, each input comprising a recording of a singing voice singing the song, to determine, for the first input, one or more relative measures of quality of the singing voice by comparing the first input to each further input; and to assess quality of the singing voice of the first input based on the one or more relative measures. Also disclosed is a method implemented on such a system.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a national phase entry under 35 U.S.C. § 371of International Application No. PCT/SG2020/050457, filed Aug. 5, 2020,published in English, which claims priority from Singapore PatentApplication Serial No. 10201907238Y, filed Aug. 5, 2019, both of whichare incorporated herein by reference.

TECHNICAL FIELD

The present invention relates, in general terms, to a system forassessing quality of a singing voice singing a song, and a methodimplement or instantiated by such a system. The present inventionparticularly relates to, but is not limited to, evaluation of singingquality without using a standard reference for that evaluation.

BACKGROUND

Singing has always been a popular medium of social recreation. Manyamateur and aspiring singers desire to improve their singing ability.This is often done with reference to a baseline tune sung by an expertsinger that the amateur or aspiring singer endeavours to emulate. Musicexperts evaluate singing quality with the help of their music knowledgeand perceptual appeal.

Computer-assisted singing learning tools have been found to be usefulfor singing lessons. Recently, karaoke singing apps and online platformshave provided a platform for people to showcase their singing talent,and a convenient way for amateur singers to practice and learn singing.They also provide an online competitive platform for singers to connectwith other singers all over the world and improve their singing skills.Automatic singing evaluation systems on such platforms typically comparea sample singing vocal with a standard reference such as a professionalsinging vocalisation or the song melody notes to obtain an evaluationscore. For example, Perceptual Evaluation of Singing Quality (PESnQ)measures the similarity between a test singing (i.e. singing voice) anda reference singing in terms of pitch, rhythm, vibrato, etc. However,such methods are constrained either by the need for a professional gradesinger, or the availability of a digital sheet music for every song, toestablish a baseline tune or melody against which each singer's testsinging can be compared. The aesthetic perception of singing quality isvery subjective and varies between evaluators. As a result, even expertsoften disagree on the perfection of a certain performance. The choice ofan ideal or gold-standard reference singer brings in a bias ofsubjective choice.

Aspiring singers upload cover versions of their favourite songs toonline platforms, that are listened to and liked by millions across theglobe. However, discovering talented singers from such huge datasets isa challenging task. Moreover, oftentimes the cover songs don't followthe original music scores, but rather demonstrate the creativity andsinging style of individual singers. In such cases, reference singing ormusical score-based evaluation methods are less than ideal for singingevaluation.

There have been a few studies on evaluating singing quality without astandard reference. However, these studies typically focus on a singlemeasure to infer singing quality and disregard other characteristics ofsinging. Using pitch as an example, if a singer sings only one notethroughout the song, pitch interval accuracy will classify it as goodsinging. Therefore, it fails fundamentally and overlooks the occurrenceof several notes in a song and different notes being sustained fordifferent durations.

In addition, many such methods still require a reference melody todetermine whether note locations (i.e. timing) is correct.

It would be desirable to overcome or ameliorate at least one of theabove-described problems with prior art singing quality evaluationschema, or at least to provide a useful alternative.

SUMMARY

Automatic evaluation of singing quality can be done with the help of areference singing or the digital sheet music of the song. However, sucha standard reference is not always available. Described herein is aframework to rank a large pool of singers according to their singingquality without any standard reference. In various embodiments, thisranking methodology involves identifying musically motivated absolutemeasures (i.e. of singing quality) based on a pitch histogram, andrelative measures based on inter-singer statistics to evaluate thequality of singing attributes such as intonation and rhythm.

The absolute measures evaluate the how good a pitch histogram is for aspecific singer, while the relative measures use the similarity betweensingers in terms of pitch, rhythm, and timbre as an indicator of singingquality. Thus, embodiments described herein combine absolute measuresand relative measures in the assessment of singing quality the corollaryof which is then to rank singers amongst each other. With the relativemeasures, the concept of veracity or truth-finding is formulated forranking of singing quality. A self-organizing approach to rank-orderinga large pool of singers based on these measures has been validated asset out below. The fusion of absolute and relative measures results inan average Spearman's rank correlation of 0.71 with human judgments in a10-fold cross validation experiment, which is close to the inter-judgecorrelation.

Humans are known to be better at relative judgments, i.e. choosing thebest and the worst among a small set of singers, than they are atproducing an absolute rating. As a result, the present disclosureexplores and validates the idea of automatically generating a leaderboard of singers, where the singers are rank-ordered according to theirsinging quality relative to each other. With the immense number ofonline uploads on singing platforms, it is now possible with the presentteachings to leverage comparative statistics between singers as well asmusic theory to derive such a leader board of singers.

Embodiments of the systems and methods disclosed herein can rank andevaluate singing vocals of many different singers singing the same song,without needing a reference template singer or a gold-standard. Thepresent algorithm, when combined with the other features of the methodwith which it interacts, will be useful as a screening tool for onlineand offline singing competitions. Embodiments of the algorithm can alsoprovide feedback on the overall singing quality as well as on underlyingparameters such as pitch, rhythm, and timbre, and can therefore serve asan aid to the process of learning how to sing better, i.e. a singingteaching tool.

Disclosed herein is a system for assessing quality of a singing voicesinging a song, comprising:

memory; and

at least one processor, wherein the memory stores instructions that,when executed by the at least one processor, cause the at least oneprocessor to:

-   -   receive a plurality of inputs comprising a first input and one        or more further inputs, each input comprising a recording of a        singing voice singing the song;    -   determine, for the first input, one or more relative measures of        quality of the singing voice by comparing the first input to        each further input; and    -   assess quality of the singing voice of the first input based on        the one or more relative measures.

The at least one processor may determine one or more relative measuresby assessing a similarity between the first input and each furtherinput. The at least one processor may assess a similarity between thefirst input and each further input by, for each relative measure,assessing one or more of a similarity of pitch, rhythm and timbre. Theat least one processor may assess the similarity of pitch, rhythm andtimbre as being inversely proportional to a pitch-based relativedistance, rhythm-based relative distance and timbre-based relativedistance respectively of the singing voice of the first input relativeto the singing voice of each further input. For a second inputcomprising a recording of a singing voice singing the song, the at leastone processor may determine the singing voice of the first input to behigher quality than the singing voice of the second input if thesimilarity between the first input and each further input is greaterthan a similarity between the second input and each further input.

The instructions may further cause at least one processor to determine,for the first input, one or more absolute measures of quality of thesinging voice, and assess quality of the singing voice based on the oneor more relative measures and the one or more absolute measures. Eachabsolute measure of the one or more absolute measures may be anassessment of one or more of pitch, rhythm and timbre of the singingvoice of the first input. At least one said absolute measure may be anassessment of pitch based on one or more of overall pitch distribution,pitch concentration and clustering on musical notes. The at least oneprocessor may assess pitch by producing a pitch histogram, and assessesa singing voice as being of higher quality as peaks in the pitchhistogram become sharper.

The instructions may further cause the at least one processor to rankthe quality of the singing voice of the first input against the qualityof the singing voice of each further input.

Also disclosed herein is a method for assessing quality of a singingvoice singing a song, comprising:

receiving a plurality of inputs comprising a first input and one or morefurther inputs, each input comprising a recording of a singing voicesinging the song;

determining, for the first input, one or more relative measures ofquality of the singing voice by comparing the first input to eachfurther input; and

assessing quality of the singing voice of the first input based on theone or more relative measures.

Determining one or more relative measures may comprise assessing asimilarity between the first input and each further input. Assessing asimilarity between the first input and each further input may comprise,for each relative measure, assessing one or more of a similarity ofpitch, rhythm and timbre. The similarity of pitch, rhythm and timbre maybe assessed as being inversely proportional to a pitch-based relativedistance, rhythm-based relative distance and timbre-based relativedistance respectively of the singing voice of the first input relativeto the singing voice of each further input.

For a second input comprising a recording of a singing voice singing thesong, the singing voice of the first input may be determined to behigher quality than the singing voice of the second input if thesimilarity between the first input and each further input is greaterthan a similarity between the second input and each further input.

The method may further comprise determining, for the first input, one ormore absolute measures of quality of the singing voice, and assessingquality of the singing voice based on the one or more relative measuresand the one or more absolute measures. Each absolute measure of the oneor more absolute measures may be an assessment of one or more of pitch,rhythm and timbre of the singing voice of the first input. At least onesaid absolute measure may be an assessment of pitch based on one or moreof overall pitch distribution, pitch concentration and clustering onmusical notes. Assessing pitch may involve producing a pitch histogram,and wherein a singing voice is assessed as being of higher quality aspeaks in the pitch histogram become sharper.

The method may further comprise ranking the quality of the singing voiceof the first input against the quality of the singing voice of eachfurther input.

Presently, there is no available method for reference-independent,rank-ordering of singers. Advantageously, embodiments of the system andmethod described herein enable automatic rank ordering of singerswithout relying on a reference singing rendition or melody. As a result,automatic singing quality evaluation is not constrained by the need fora reference template (e.g. baseline melody or expert vocal rendition)for each song against which a singer is being evaluated.

Similarly, there is presently no available tool that providesresearch-validated feedback on underlying musical parameters for singingquality evaluation. Embodiments of the algorithm described herein, whenused in conjunction with other features described herein, can serve asan aid to singing teaching that provides feedback on overall singingquality as well as on underlying parameters such as pitch, rhythm, andtimbre.

Advantageously, embodiments of the present invention provide evaluationof singing quality based on the musically-motivated absolute measuresthat quantify various singing quality discerning properties of a pitchhistogram. Consequently, the singer may be creative and not copy thereference or baseline melody exactly, and yet sound good be evaluated assuch. Accordingly, such an evaluation of singing quality helps avoidpenalising singers for creativity and captures the inherent propertiesof singing quality.

Advantageously, embodiments provide singing quality evaluation based ontruth pattern finding based musically-inform relative measures bothsinging quality, that leverage inter-singer statistics. This provides aself-organising data-driven way of rank-ordering singers, to avoidrelying on a reference or template—e.g. baseline melody.

Advantageously, embodiments of the present invention enable evaluationof underlying parameters such as pitch, rhythm and the timbre withoutrelying on a reference. Experimental evidence discussed herein indicatesthat machines can provide the law robust and unbiased assessment of theunderlying parameters of singing quality when compared with a humanassessment.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the present invention will now be described, by way ofnon-limiting example, with reference to the drawings in which:

FIG. 1 provides a method in accordance with present teachings, forassessing singing quality;

FIG. 2 provides a schematic diagram of a system for performing themethod of FIG. 1;

FIG. 3 is a normalized pitch histogram for (a) MIDI, and GMM-fit anddetected peaks (dots) on normalized pitch histogram for (b) good singing(c) poor singing of the song “I have a dream” by ABBA. (1 bin=10 cents);

FIG. 4 is a normalized pitch histogram (1 bin=10 cents) (top),autocorrelation of the histogram (middle), and the magnitude of theFourier transform of the autocorrelation (bottom) for (a) good singing(b) poor singing;

FIG. 5 is a visualization of the pitch-based relative measure distancemetric pitch_med_dist between each singer and the remaining 99 singers,for the best 3 (top row) and the worst 3 (bottom row) singers among 100singers singing the song “Let it go”;

FIG. 6 demonstrates relative scoring methods from the pitch_med_distmeasure for the best (Rank 1) and the worst (Rank 100) singer of anexample song (Song 1, snippet 1), along with the respective relativemeasure values or scores using: (a) Method 1: Affinity by Headcount (b)Method 2: Affinity by kth Nearest Distance, k=10 (c) Method 3: Affinityby Median Distance. The circle in (a) and (b) are the thresholds, whilefor (c) it is the median value.

FIG. 7 is an overview of the framework for automatic singing qualityleader board generation, consisting of a fusion of a musically-motivatedabsolute scoring system and an inter-singer distance based scoringsystem;

FIG. 8 is the Spearman's rank correlation performance of three methodsfor inter-singer distance measurement (Singer characterisation usinginter-singer distance): Method 1: Affinity by Headcount; Method 2:Affinity by 10th Nearest Distance; Method 3: Affinity by MedianDistance;

FIG. 9 shows the Spearman's rank correlation of the individual absolutemeasures (top) and relative measures (bottom) with human BWS ranks; and

FIG. 10 shows the Humans vs. Machines experimental outcomes: correlationbetween scores given individually for pitch, rhythm, and timbre by (a)human experts, (b) machine on the same data as in (a), and (c) machine,on the data used in this work, as reflected in Table III.

DETAILED DESCRIPTION

It has been determined that music experts can evaluate singing qualitywith high consensus when the melody or the song is unknown to them. Thissuggests that there are inherent properties of singing quality that areindependent of a reference singer or melody, which help themusic-experts judge singing quality without a reference. The presentdisclosure explores these properties and proposes methods toautomatically evaluate singing quality without depending on a reference,and systems that implements such methods.

The teachings of the present disclosure are extended to cover thediscovery of good or quality singers from a large number singers byassessing the similarities all the relative distances between singers.Based on the concept of veracity, it is postulated that good singerssing alike or similarly and bad singers seem very differently to eachother. Consequently, if all singers sing the same song, the good singerswill share many characteristics such as frequently it notes, thesequence of notes and the overall consistency in the rhythm of the song.Conversely, different poor singers will deviate from the intended songin different ways. For example, one poor singer may be out of tune atcertain notes while another may be at other notes. As a result,relatives measures based on inter-singer distance can serve as anindicator of singing quality.

Embodiments of the methods and systems described herein provide aframework to combine pitch histogram-based measures with theinter-singer distance measures to provide a comprehensive singingquality assessment without relying on a standard reference. We assessthe performance of our algorithm by comparing against human judgments.

In the context of singing pedagogy, a detailed feedback to a learnerabout their performance with respect to the individual underlyingperceptual parameters such as pitch, rhythm, and timbre, is important.Although humans are known to provide consistent overall judgments, theyare not good at objectively judging the quality of individual underlyingparameters. As such, singing quality evaluation schema described hereinoutperform human judges in this regard.

Such a method for assessing quality of the singing voice singing a songis described with reference to the steps shown in FIG. 1. The method 100broadly comprises:

Step 102: receiving a plurality of inputs. The inputs comprise a firstinput and one or more further inputs. Each input comprising a recordingof a singing voice singing the song. The first input is the recording ofthe singing voice for which the assessment is being made. Each furtherinput is a recording of a singing voice against which the first input isbeing assessed, which may be the singing voice of another singer oranother recording made by the same singer is that who recorded the firstinput.

Step 104: determining, for the first input, one or more relativemeasures of quality of the singing voice. As will be discussed ingreater detail below, this is performed by comparing the first input toeach further input.

Step 106: assessing quality of the singing voice of the first inputbased on the one or more relative measures.

The method 100 may be executed in a computer system such as that shownin FIG. 2. As set out briefly below, the computer system is forassessing quality of the singing voices singing a song, and willcomprise memory and at least one processor, the memory storinginstructions that when executed by the at least one processor will causethe computer system to perform method 100.

Various embodiments of method 100 make the following major contributionseach of which is discussed in greater detail below. Firstly, embodimentsof the method 100 uses novel inter-singer relative measures based on theconcept of veracity, that enable rank-ordering of a large number ofsinging renditions without relying on reference singing. Secondly,embodiments of the method 100 uses a combination of absolute andrelative measures to characterise the inherent properties of singingquality—e.g. those that might be picked up by a human assessor but notby known machine-based assessors.

The method 100 may be employed, for example, on a computer system 200 asshown in FIG. 2. The block diagram of the computer system 200 willtypically be a desktop computer or laptop. However, the computer system200 may instead be a mobile computer device such as a smart phone, apersonal data assistant (PDA), a palm-top computer, or multimediaInternet enabled cellular telephone.

As shown, the computer system 200 includes the following components inelectronic communication via a bus 212:

-   -   (a) relative measures module 202;    -   (b) absolute measures module 204;    -   (c) ranking module 206;    -   (d) a display 208;    -   (e) non-volatile (non-transitory) memory 210;    -   (f) random access memory (“RAM”) 214;    -   (g) N processing components embodied in processor module 216;    -   (h) a transceiver component 218 that includes N transceivers;        and    -   (i) user controls 220.

Although the components depicted in FIG. 2 represent physicalcomponents, FIG. 2 is not intended to be a hardware diagram. Thus, manyof the components depicted in FIG. 2 may be realized by commonconstructs or distributed among additional physical components.Moreover, it is certainly contemplated that other existing and yet-to-bedeveloped physical components and architectures may be utilized toimplement the functional components described with reference to FIG. 2.

The three main subsystems the operation of which is described herein indetail are the relative measures module 202, the absolute measuresmodule 204 and the ranking module 206. The various measures calculatedby module 202 and 204, and/or the ranking is determined by module 206,may be displayed on display 208. The display 208 may be realized by anyof a variety of displays (e.g., CRT, LCD, HDMI, micro-projector and OLEDdisplays).

In general, the non-volatile data storage 210 (also referred to asnon-volatile memory) functions to store (e.g., persistently store) dataand executable code, such as the instructions necessary for the computersystem 200 to perform the method 100, the various computational stepsrequired to achieve the functions of modules 202, 204 and 206. Theexecutable code in this instance thus comprises instructions enablingthe system 200 to perform the methods disclosed herein, such as thatdescribed with reference to FIG. 1.

In some embodiments for example, the non-volatile memory 210 includesbootloader code, modem software, operating system code, file systemcode, and code to facilitate the implementation components, well knownto those of ordinary skill in the art that, for simplicity, are notdepicted nor described.

In many implementations, the non-volatile memory 210 is realized byflash memory (e.g., NAND or ONENAND memory), but it is certainlycontemplated that other memory types may be utilized as well. Althoughit may be possible to execute the code from the non-volatile memory 210,the executable code in the non-volatile memory 210 is typically loadedinto RAM 214 and executed by one or more of the N processing components216.

The N processing components 216 in connection with RAM 214 generallyoperate to execute the instructions stored in non-volatile memory 210.As one of ordinarily skill in the art will appreciate, the N processingcomponents 216 may include a video processor, modem processor, DSP,graphics processing unit, and other processing components. The Nprocessing components 216 may form a central processing unit (CPU),which executes operations in series. In some embodiments, it may bedesirable to use a graphics processing unit (GPU) to increase the speedof analysis and thereby enable, for example, the real-time assessment ofsinging quality—e.g. during performance of the song. Whereas a CPU wouldneed to perform the actions using serial processing, a GPU can providemultiple processing threads to identify features/measures or comparesinging inputs in parallel.

The transceiver component 218 includes N transceiver chains, which maybe used for communicating with external devices via wireless networks,microphones, servers, memory devices and others. Each of the Ntransceiver chains may represent a transceiver associated with aparticular communication scheme. For example, each transceiver maycorrespond to protocols that are specific to local area networks,cellular networks (e.g., a CDMA network, a GPRS network, a UMTSnetworks), and other types of communication networks.

Reference numeral 224 indicates that the computer system 200 may includephysical buttons, as well as virtual buttons such as those that would bedisplayed on display 208. Moreover, the computer system 200 maycommunicate with other computer systems or data sources over network226.

It should be recognized that FIG. 2 is merely exemplary and that thefunctions described herein may be implemented in hardware, software,firmware, or any combination thereof. If implemented in software, thefunctions may be stored on, or transmitted as, one or more instructionsor code encoded on a non-transitory computer-readable medium 210.Non-transitory computer-readable medium 210 includes both computerstorage medium and communication medium including any medium thatfacilitates transfer of a computer program from one place to another. Astorage medium may be any available medium that can be accessed by acomputer, such as a USB drive, solid state hard drive or hard disk.

To provide versatility, it may be desirable to implement the method 100in the form of an app, or use an app to interface with a server on whichthe method 100 is executed. These functions and any other desiredfunctions may be achieved using apps 222, which can be installed on amobile device. The apps 222 may also enable singers using separatedevices to compete in a singing competition evaluated using the method100—e.g. to see who achieves the highest ranking whether at the end of asong or in real time during performance of the song.

Musically-Motivated Measures

Some studies have found that human judges can evaluate singers with highconsistency even when the songs are unknown to the judges. This findingsuggests that singing quality judgment depends more on common, objectivefeatures rather than subjective preference. Moreover, experts make theirjudgment neither relying on their memory of the song, nor a referencemelody.

Subjective assessment studies suggest that the most important propertiesfor singing quality evaluation are pitch and rhythm. To enable anautomated assessment to be performed, the method 100 further includesstep 108, for determining absolute measures of quality (the pitch beingone such absolute measure), and the memory 210 similarly includesinstructions to cause the N processing units 216 to determine, usingmodule 206, one or more absolute measures of quality of the singingvoice of the first input (i.e. the input being assessed). Quality of thesinging voice can then be assessed based on one or more relativemeasures discussed below, and one or more absolute measures such as thepitch, rhythm and timbre.

Considering pitch firstly, pitch is an auditory sensation in which alistener assigns musical tones to relative positions on a musical scalebased primarily on their perception of the frequency of vibration. Pitchis characterized by the fundamental frequency FO and its movementsbetween high and low values. Musical notes are the musical symbols thatindicate the pitch values, as well as the location and duration ofpitch, i.e. the timing information or the rhythm of singing. In karaokesinging, visual cues to the lyric lines to be sung are provided to helpthe singer have control over the rhythm of the song. Therefore, in thecontext of karaoke singing, rhythm is not expected to be a majorcontributor to singing quality assessment. Pitch, however, can beperceived and computed. Therefore, characterization of singing pitch isa focus of the system 200. The particular qualities sought to beextracted from the inputs can include one or more of the overall pitchdistribution of a singing voice, the pitch concentration and clusteringon musical notes. To perform this extraction, pitch histograms can beuseful.

A. Pitch Histogram

Pitch histograms are global statistical representations of the pitchcontent of a musical piece. They represent the distribution of pitchvalues in a sung rendition. A pitch histogram is computed as the countof the pitch values folded on to the 12 semitones in an octave. Toenable an analysis, the methods disclosed herein may calculate pitchvalues in the unit of cents (one semitone being 100 cents onequi-tempered octave). That calculation may be performed according to:

$\begin{matrix}{f_{cent} = {\log_{2}\frac{f_{Hz}}{440}}} & (1)\end{matrix}$

where 440 Hz (pitch-standard musical note A4) is considered as the basefrequency. Presently, pitch estimates are produces from knownauto-correlation based pitch estimators. Thereafter, a genericpost-processing step is used to remove frames with low periodicity.

Computing the pitch histogram may comprise removing the key of the song.A number of steps may be performed here. This can involve convertingpitch values to an equi-tempered scale (cents). This may also involvesubtracting the median from the pitch values. Since median does notrepresent the tuning frequency of a singer, the pitch histogram obtainedthis way may show some shift across singers. However, it does not affectthe strength of the peaks and valleys in the histogram. Also, as thedata used to validate this calculation was taken from karaoke where thesingers sang along with the background track of the song—accordingly,the key is supposed to remain the same across singers (i.e. it cannot beused as a benchmark).

The median of pitch values in a singing rendition is subtracted. Allpitch values are transposed to a single octave, i.e. within −600 to +600cents. The pitch histogram H is then calculated by placing the pitchvalues into corresponding bins (i.e. subranges in the single octave intowhich all pitch values are transposed):

H _(k)=Σ_(n-1) ^(N) m _(k)  (2)

where H_(k) is the k^(th) bin count, N is the number of pitch values,m_(k)=1 if c_(k)<P(n)<c_(k)+1 and mk=0 otherwise, where P(n) is then^(th) pitch value in an array of pitch values and (c_(k), c_(k)+1) arethe bounds on k^(th) bin in cents in the octave to which all the pitchvalues are transposed. To obtain a fine histogram representation, eachsemitone was divided into 10 bins. Thus, 12 semitones×10 bins each=120bins in total, each representing 10 cents. It will be appreciated that adifferent number of bins may be used and/or each pin may represent anumber of cents other than 10.

The melody of a song typically consists of a set of dominant musicalnotes (or pitch values). These are the notes that are hit frequently inthe song and sometimes are sustained for long duration. These dominantnotes are a subset of the 12 semitones present in an octave. The othersemitones may also be sung during the transitions between the dominantnotes, but are comparatively less frequent and not sustained for longdurations. Thus, in the pitch histogram of a good singing vocal of asong, these dominant notes should appear as the peaks, while thetransition semitones appear in the valley regions.

FIG. 3 shows the pitch histogram of a MIDI (Musical Instrument DigitalInterface) signal (FIG. 3a ), the pitch histogram of a good singingvocal or vocalisation (FIG. 3b ), and a poor singing vocal orvocalisation (FIG. 3c ), all performing the same song. The area ofhistogram is normalized to 1. The MIDI version contains the notes of theoriginal composition, and therefore represents the canonical pitchhistogram of the song. It is apparent that the good singer histogramshould be close to the MIDI histogram. The MIDI histogram has four sharppeaks showing that those pitch values are frequently and consistentlyhit, more than the rest of the pitch values. Since, generally, a songconsists of only a set of dominant notes, the sharp, narrow, andwell-defined spikes/peaks of the good singer's pitch histogram indicatethat the notes of the song are being hit repeatedly and consistently, ina similar manner to the MIDI histogram. On the other hand, the poorsinger has a dispersed distribution of pitch values that reflect thatthe singer is unable to hit the dominant notes of the song consistently.Therefore, a singing voice may be assessed as being of higher quality aspeaks in the pitch histogram become sharper.

Some statistical measures, kurtosis and skew, were used to measure thesharpness of the pitch histogram. These are overall statisticalindicators that do not place much emphasis on the actual shape of thehistogram, which could be informative about the singing quality.Therefore, for present purposes, the musical properties of singingquality are characterised with the 12 semitones pitch histogram. It isexpected that the shape of this histogram, for example, the number ofpeaks, the height and spread of the peaks, and the intervals between thepeaks contain vital information about how well the melody is sung.Therefore, assessing the singing voice may involve determining one ormore of the numbers of peaks in the histogram, the height of the peaks,the spread (or sharpness) of the peaks and/or the intervals between thepeaks. Although the correctness or accuracy of the notes being sung canbe directly determined when the notes of the song are not available, theconsistency of the pitch values being hit, which is an indicator of thesinging quality, can still be measured.

B. Pitch Assessment from the Perspective of Overall Pitch Distribution

Overall pitch distribution is a group of global statistical measuresthat computes the deviation of the pitch distribution from a normaldistribution. As seen in FIG. 3, the pitch histograms of good singersshow multiple sharp peaks, while those of poor singers show a disperseddistribution of pitch values. Therefore, the histogram of a poor singerwill be closer to a normal distribution, than that of a good singer.Accordingly, assessing the quality of the singing voice of the firstinput may involve analysing the overall pitch distribution.

1) Kurtosis: Kurtosis is a statistical measure (fourth standardizedmoment) of whether the data is heavy tailed or light tailed relative toa normal distribution, defined as:

$\begin{matrix}{{Kurt} = {E\left\lbrack \left( \frac{\overset{\rightarrow}{x} - \mu}{\sigma} \right)^{4} \right\rbrack}} & (3)\end{matrix}$

where {right arrow over (x)} is the data vector, which in the presentcase is the pitch values over time, μ is the mean and a is the standarddeviation of {right arrow over (x)}.

A good singer's pitch histogram is expected to have several sharpspikes, as shown in FIG. 3b . Therefore, a good singer's pitch histogramshould not reflect a normal distribution. A corollary of this is that agood singer would have a higher kurtosis value than a poor singer.Accordingly, assessing the quality of the singing voice of the firstinput may involve assessing kurtosis, where a higher kurtosis isindicative of better quality singing.

2) Skew: Skew is a measure of the asymmetry of a distribution withrespect to the mean, defined as:

$\begin{matrix}{{Skew} = {E\left\lbrack \left( \frac{\overset{\rightarrow}{x} - \mu}{\sigma} \right)^{3} \right\rbrack}} & (4)\end{matrix}$

where {right arrow over (x)} is the data vector, μ is the mean and σ isthe standard deviation of {right arrow over (x)}.

The pitch histogram of a good singer has peaks around the notes of thesong, whereas that of a poor singer is expected to be more dispersed andspread out relatively symmetrically. So, the pitch histogram of a poorsinger is expected to be closer to a normal distribution FIG. 3c , ormore symmetrical. Accordingly, assessing the quality of the singingvoice of the first input may involve assessing skew, where higherasymmetry as reflected by the skew value is indicative of better qualitysinging.

C. From the Perspective of Pitch Concentration

The previous group of measures considered the overall distribution ofthe pitch values with respect to a normal distribution. However, thosemeasures do not reference whether the singing vocal hits the musicalnotes. For example, a consistent, incorrect note may be sung that leadsto a very distinct peak in a histogram. It would therefore be usefulquantify the precision with which the notes are being hit.

One method as taught herein for assessing singing quality involvesmeasuring the concentration of the pitch values in the pitch histogram.Multiple sharp peaks in the histogram indicate precision in hitting thenotes. Moreover, the intervals between these peaks contain informationabout the relative location of these notes in the song indicating themusical scale in which the song was sung.

1) Gaussian mixture model-fit (GMM-fit): To capture the fine details ofthe histogram, a mixture of Gaussian distributions is used to model thepitch histogram. FIGS. 3b and 3c , show the GMM-fit for a good and poorsinger respectively. After experimenting with the number of mixtures, itwas found that good singers require a higher number of mixtures due tothem producing many concentrated, sharp peaks. Empirically, the numberof mixtures was set to 150, though any suitable number may be used asappropriate. To characterise the peaks in the histogram, the localmaxima in the GMM-fit are detected. A point is considered to be a goodcandidate peak if preceeded and succeeded by a lower value. Also,empirically, a good candidate is found if it is the highest peak within±50 cents. The methods as taught herein may then characterise singingquality on the basis of the detected peaks. The methods may perform thischaracterisation in one or both of the following two ways.

Firstly, the method may measure the spread around the peak, that spreadindicating the consistency with which a particular note is hit. Thisspread is referred to herein as the Peak Bandwidth (PeakBW), which maybe defined as:

$\begin{matrix}{{PeakBW} = {\frac{1}{N^{2}}{\sum_{i = 1}^{N}w_{i}^{2}}}} & (5)\end{matrix}$

where w_(i) is the 3 dB half power down width of the i^(th) detectedpeak.

In embodiments where the first input and further input relate to a popsong, such a song can be expected to have more than one or twosignificant peaks. Therefore, an additional penalty is applied if thereis only a small number of peaks, by dividing by the number of peaks N.Therefore, peak-BW measure averaged over the number of peaks becomesinversely proportional to N².

Secondly, the method may involve measuring the percentage of pitchvalues around the peaks. This is referred to herein as the PeakConcentration (PeakConc) measure, and may be defined as:

$\begin{matrix}{{PeakConc} = \frac{\sum_{j = 1}^{N}{\sum_{i = {{bin}_{j} - \Delta}}^{{bin}_{j} + \Delta}A_{i}}}{\sum_{k = 1}^{M}A_{k}}} & (6)\end{matrix}$

where Nis the number of peaks, bin_(j) is the pin number of the j^(th)peak, A₁ is the histogram value of the i^(th) bin, and M is the totalnumber of bins (120 in the present example, each representing 10 cents).Human perception is known to be sensitive to pitch changes, but thesmallest perceptible change is debatable. There is general agreementamong scientists that average adults are able to recognise pitchdifferences of as small as 25 cents reliably. Thus, in equation (6), Ais the number of bins on either side of the peak being considered, formeasuring peak concentration. A represents the allowable range of pitchchange in the relevant input without that input being perceived asout-of-tune. Next, empirical consideration is given to A values of ±5and ±2 bins, i.e. ±50 cents and ±20 cents respectively, which along withthe centre bin (10 cents), result in a total of 110 cents and 50 cents,respectively. These measures are referred to as PeakConc₁₁₀ andPeakConc₅₀ respectively.

2) Autocorrelation: singers are supposed to sing mostly around the 12semitones. The minimum interval is one semitone, and the intervalsbetween the musical notes should be one or multiples of a semitone, thatcan be observed if we perform autocorrelation on the pitch histogram forthe singer. If a good singer hits the correct notes all the time, weexpect to see sharp peaks at multiples of semitones in the Fouriertransform of the autocorrelation of the pitch histogram is. This isevident from FIG. 4 (bottom tier—FFT graph) where the magnitude spectrumof the autocorrelation of a good singing pitch histogram has energy inthe higher frequencies representing the interval pattern of thestrengthened peaks in the pitch histogram. In contrast, that of the poorsinging sample only has a zero frequency component.

The present method may involve computing the autocorrelation energyratio measure, referred to herein as Autocorr, as the ratio of theenergy in the higher frequencies to the total energy in the Fouriertransform of the autocorrelation of the histogram. Autocorr may bedefined as:

$\begin{matrix}{{{Autocorr} = \frac{\sum_{f = {4Hz}}{❘{Y(f)}❘}^{2}}{\sum_{f = {0Hz}}{❘{Y(f)}❘}^{2}}}{where}} & (7)\end{matrix}$ $\begin{matrix}{{Y(f)} = {F\left( {\sum_{n = 1}^{120}{{y(n)}{y^{*}\left( {n - l} \right)}}} \right)}} & (8)\end{matrix}$

i.e. the Fourier transform of the autocorrelation of the histogram y(n)where n is the bin number, and the total number of bins is 120, and/isthe lag. The lower cut-off frequency of 4 Hz in the numerator ofequation (7) corresponds to the assumption that at least 4 dominantnotes are expected in a good singing rendition—i.e. 4 cycles per second.When used in the methods disclosed herein, the number of expecteddominant notes may be fewer than 4 or greater than 4 as required for theparticular type of music and/or particular application.

D. Clustering Based on Musical Notes

As discussed above, a song typically consists of a set of dominantmusical notes. Although the melody of the song may be unknown, it isforeseeable that the pitch values, when the song is sung, will beclustered around the dominant musical notes. Therefore, those dominantnotes serve as a natural reference for evaluation. The methods disclosedherein may measure clustering behaviour. The methods may achieve this inone or both of two ways.

1) k-Means Clustering: Tightly grouped clusters of pitch values acrossthe histogram indicate that most of the pitch values are close to thecluster centres. This in turn means that the same notes are hitconsistently. Keeping this idea in mind, the method may involve applyingk-Means clustering to the pitch values. In the present embodiments, k=12for the 12 semitones in an octave.

Whether the pitch values are tightly or loosely clustered can berepresented by the average distance of each pitch value to itscorresponding cluster centroid. This distance is inversely proportionalto the singing quality, i.e. smaller the distance, better the singingquality. This singing quality may be assessed by determining an averagedistance of one or more pitch values of the first input to itscorresponding cluster centroid. The average cluster distance may bedefined as:

$\begin{matrix}{{kMeans} = {\frac{1}{L}{\sum_{i = 1}^{k}d_{i}^{2}}}} & (9)\end{matrix}$

where L is the total number of frames with valid pitch values, and d_(i)is the total distance of the pitch values from the centroid in i^(th)cluster. This may be defined as:

d _(i) ²=Σ_(j=1) ^(L) ^(i) (p _(ij) −c _(i))²  (10)

where p_(ij) is the j^(th) pitch value in i^(th) cluster, c_(i) is thei^(th) cluster centroid obtained from the k-Means algorithm, L_(i) isthe number of pitch values in i^(th) cluster, and I ranges from 1, 2, .. . , k number of clusters.

The difference between this measure and the PeakBW measure is thatPeakBW is a function of the number of dominant peaks, whereas in kMeans,the number of clusters is fixed to 12, corresponding to all the possiblesemitones in an octave. Thus, they are different in capturing theinfluence of the dominant notes on the evaluation measure.

2) Binning: Another way to measure the clustering of the pitch values isby simply dividing the 1200 cents (or 120 pitch bins) into 12equi-spaced semitone bins, and computing the average distance of eachpitch value to its corresponding bin centroid. Equations (9) and (10)hold true for this method too, the only difference is that the clusterboundaries are fixed in binning methods at 100 cents.

Therefore, the method may employ one or more of eightmusically-motivated absolute measures for evaluating singing qualitywithout a reference: Kurt, Skew, PeakBW, PeakConc PeakConc₅₀, kMeans,Binning and Autocorr. These are set out in Table I along with theinter-singer relative measures discussed below.

TABLE I list of musically-motivated absolute and inter-singer relativemeasures Measure Group Sub-group based on Measure namesMusically-motivated Overall pitch Kurt, Skew absolute measuresdistribution PeakBW, PeakConc₁₁₀, Pitch concentration PeakConc₅₀,Autocorr Clustering kMeans, Binning Inter-singer Pitch pitch_med_distdistance-based pitch_med_L2 relative measures pitch_med_L6_L2pitchhist12DDistance pitchhist120DDistance pitchhisKLD12 pitchhistKLD120Rhythm molina_rhythm_mfcc_dist rhythm_L2 rhythm_L6_L2 Timbretimbral_dist

Inter-Singer Measures

Present methods evaluate singing quality (e.g. of a first input) withouta reference by leveraging on the general behaviour of the singing vocalsof the same song by a large number of singers (e.g. further inputs).This approach uses inter-singer statistics to rank-order the singers ina self-organizing way.

The problem of discovering good singers from a large pool of singers issimilar to that of finding true facts from a large amount of conflictinginformation provided by various websites. To assist, the method mayemploy a truth-finder algorithm that utilizes relationships betweensinging voices and their information. For example, a particular input,singing vocal, may be considered to be of good quality if it providesmany notes or other pieces of information that are common to other onesof the inputs considered by the present methods. The premise behind thetruth-finder algorithm is the heuristic that there is only one truepitch at any given location in a song. Similarly, a correct pitch, beingtantamount to a true fact identifiable by a true-finder algorithm,should appear in the same or similar way in various inputs. Conversely,incorrect pictures should be different and dissimilar between inputs,because there are many ways of singing an incorrect pitch. Accordingly,the present methods may employ a true-finder algorithm to determinecorrect pitches on the basis that a song can be sung correctly by manypeople in one consistent way, but incorrectly in many different,dissimilar ways. So, the quality of a perceptual parameter of a singeris proportional to his/her similarity with other singers with respect tothat parameter.

The method may therefore involve measuring similarity between singers.To achieve this, a feature may be defined that represents a perceptualparameter of singing quality, for example pitch contour. It is thenassumed that all singers are singing the same song, and the feature fora particular input (i.e. of a singer) can be compared with every otherinput (e.g. every other singer) using a distance metric.

Accordingly, the methods disclosed herein may determine singing qualityat least in part by determining how similar the first input is to eachfurther input, wherein greater similarity reflects a higher qualitysinging voice—a good singer will be similar to the other good singers,therefore they will be close to each other, whereas a poor singer willbe far from everyone.

FIG. 5 is a radial visualization of the Euclidean distance between thepitch contours of 100 singers, where the centre represents the singer ofinterest, and the radial distance of each dot represent his/her distance(i.e. the singer of interest's) with one of the other 99 singers. Theangular location of the dots is not part of the similarity metric—theangle is shown for illustration and visualisation purposes. It isevident that the best singers (top-ranked) are similar to other singers,therefore they are clustered around the centre. In contrast, the poorestsinger is distant from everybody else. This observation validates thehypothesis that good singers are similar, and poor quality singers aredissimilar. This also points to viability of a method of ranking singersby their similarity with the peer singers.

In the following sub-sections, metrics are discussed that the presentmethods may use to measure the inter-singer distance, as summarized inTable I. These metrics measure the distance in terms of the perceptualparameters that may include one or more of, rhythm, and timbre.Embodiments of the method may then characterise singers using suchdistance metrics. It should be understood that assessing the quality ofa singer or singing voice, being interchangeably referred to asaffecting the quality of an input such as a first input and/or secondinput, may refer to the relevant assessment being the only assessment,or that assessing the quality of the singer or singing voice is at leastin part based on the referred to assessment. In other words, where thedisclosure herein refers to assessing singing quality on the basis of adistance metric, that does not preclude the assessment of singingquality also being based on one or more other parameters such as thosesummarised in Table-I.

A. Musically-Motivated Inter-Singer Distance Metrics

Inter-singer similarity may be measured in various ways, such as byexamining pitch, rhythm and timbre in the singing.

1) Pitch-Based Relative Distance:

Intonation or pitch accuracy is directly related to the correctness ofthe pitch produced with respect to a reference singing or baselinemelody. Rather than using a baseline melody, the present teachings mayapply intonation or pitch accuracy to compare one singer with another.Importantly, it may not be known whether said another singer is a goodthing or a poor singer. Therefore, assessing a singer against anothersinger is not the same assessment as comparing a singing voice to abaseline melody or reference singing.

The distance metrics used are the dynamic time warping (DTW) distancebetween the two median-subtracted pitch contours (pitch med dist), thePerceptual Evaluation of Speech Quality (PESQ)-based cognitive modelingtheory—inspired pitch disturbance measures pitch med L6 L2 and pitch medL2.

Additionally, in this work, pitch histogram-based relative distancemetrics are computed. As seen in FIG. 3, there is a clear distinctionbetween the pitch distribution of a good and a poor singer. Embodimentsof the present method may compute the distance between the histograms ofsingers using the Kullback-Liebler (KL) Divergence between thenormalized pitch histograms. Moreover, as the pitch histogram iscomputed after subtracting the median of the pitch values, not theactual tuning frequency in which the song is sung, the pitch histogramsmay be shifted by a few bins across singers. To account for this shift,DTW-based distance is computed for the 12-bin and 120-bin histogramsbetween singers as relative measures (pitchhist12KLdist,pitchhist120KLdist, pitchhist12Ddist, pitchhist120Ddist).

2) Rhythm-Based Relative Distance:

Rhythm or tempo is defined as the regular repeated pattern in music thatrelates to the timing of the notes sung. In karaoke singing, rhythm isdetermined by the pace of the background music and the lyrics cue on thescreen. Therefore, rhythm inconsistencies in karaoke singing typicallyonly occur when the singer is unfamiliar with the melody and/or thelyrics of the song.

Mel-frequency cepstral coefficients (MFCC) capture the short-term powerspectrum that represents the shape of the vocal tract and thus thephonemes uttered. So, if the words are uttered at the same pace by twosingers, then their rhythm is consistent. Thus, present method maycompute the alignment between two singer utterances—for example, the DTWalignment between two singer utterances with respect to their MFCCvectors may be computed. Presently, the three best performing rhythmmeasures are used compute inter-singer rhythm distance. There may begreater or fewer rhythm measures used in the present methods dependingon the application and desired accuracy. The three best performingrhythm measures presently are a rhythm deviation measure (termed asMolina_rhythm_mfcc_dist) that computes the root mean square error of thelinear fit of the optimal path of DTW matrix computed using MFCCvectors, PESQ-based rhythm_L6_L2, and rhythm_L2.

3) Timbre-Based Relative Distance:

The method may also, or alternatively, assess singing quality byreference to timbre. Perception of timbre often relates to the voicequality. Timbre is physically represented by the spectral envelope ofthe sound, which is captured well by MFCC vectors. Presently, thetimbral_dist is computed, and refers to the DTW distance between theMFCC vectors between the renditions of two singers.

B. Singer Characterization Using Inter-Singer Distance

The distance between a singer and others, as discussed in relation tothe Musically-Motivated inter-Singer distance metrics, is indicative ofthe singer's singing quality. Present methods may employ one or more ofthree methods for characterising a singer based on these inter-singerdistance metrics. These methods may be referred to as relative scoringmethods, that give rise to the relative measures. Relatedly, FIG. 6,referred to below, demonstrates the relative measure computation fromthe pitch median dist distance metric with the three methods for thebest and the worst singer out of 100 singers of a song.

1) Method 1: Affinity by Headcount s_(h)(i):

The present methods may determine distance by reference to Affinity byheadcount. This may involve setting a constant (i.e. predetermined)threshold D_(T) on the distance value across all singer clusters andcounting the number of singers within the set threshold as the relativemeasure or score. If a large number of singers are similar to thatsinger—i.e. within the constant threshold—then the number of dots withinthe threshold circle will be high. This is reflected in FIG. 6(a). Ifdist_(i,j) is the distance between the i^(th) and j^(th) singers, thesinger i's relative measure s_(h)(i) by this headcount method is:

s _(h)(i)=|dist_(i,j) <D _(T) :∇j∈Q,j≠i|  (11)

where Q is the set of singers.

2) Method 2: Affinity by kth Nearest Distance s_(k)(i):

The present methods may determine distance by reference to the k^(th)nearest distance. The number of singers k can be set as the threshold,and consideration is then given to the distance of the k^(th) nearestsinger as the relative measure. This is reflected in FIG. 6(b), fork=10. If this distance is small, the singer is likely to be good.Therefore the present method may involve assessing quality of the firstinput by reference to the distance of a predetermined one of thedistances in a sequence arranged in order of distance, from the furtherinputs. Singer i's relative measure (s_(k)(i)) according to this Method2 may be defined as:

s _(k)(i)=dist_(i,j=k) ;k≠i  (12)

3) Method 3: Affinity by Median Distance s_(m)(i):

The present methods may determine distance by reference to mediandistance for all further inputs. The median of the distances of a singerfrom all other singers can be assigned as the relative measure, whichrepresents his/her overall distance from the rest of the singers (FIG.6(c)). The median is taken instead of the mean to avoid outliers. Ifthis distance is small for a singer, the singer is likely to be good.Methods described herein may therefore involve assessing the quality ofthe first input by reference to the median distance, where a lowermedian distance is indicative of a higher quality singing voice. Thesinger i's relative measure by this method is:

s _(m)(i)=median(dist_(i,j));∇j∈Q,j≠i  (13)

Ranking Strategy, and Fusion Methods

Being able to determine how good a particular singer is, is desirable.This can be achieved using the methods and various metrics and measuresas set out above. Notably, the same assessment can be extended to asecond input (i.e. for a second singer), and any other number ofsingers. In this regard, the second input may comprise a recording of asinging voice singing the same song as that sung in the first input andany other further inputs. The method may then rank the first inputagainst the second input and determine the first input to be of higherquality than the second input if the similarity between the first inputand each further input is greater than a similarity between the secondinput and each further input. Similarly, the first input may be rankedamong all of the inputs, including the further inputs. Each of theserankings can enable a leader board to be established in which singersare ranked against each other.

A. Strategy for Ranking

The primary objective of a leader board is to inform where a singerranks with respect to the singer's contemporaries. As the best-worstscaling (BWS) theory, it is understood that humans are known to be ableto choose the best and the worst in a small set of choices, which overmany such sets results in rank-ordering of the choices. However, whenhumans are asked to numerically rate singers on a scale of say 1 to 5,they do not reveal discriminatory results. Therefore, it makes sense tostudy how the absolute and relative measures reflect the ranking, anddesign an algorithm towards a better prediction of the overallrank-order of the singers.

Given a set of measure values or scores S=S₁, S₂, . . . , S_(T), whereS_(i) represents a score of the i^(th) singer, and, T is the totalnumber of singers of a song, the singers can be rank-ordered as:

rorder=(S ₍₁₎ ,S ₍₂₎ , . . . ,S _((T)))  (14)

where

S ₍₁₎ ≤S ₍₂₎ ≤ . . . ≤S _((T))  (15)

It is worth noting that all absolute and relative measures are songindependent. But a large number of singers singing the same song areneeded to reliably provide the relative measures. Also, every measure isnormalised by the number of frames, making them independent of the songduration.

B. Strategies for Score Fusion

Each of the absolute and relative measures can provide a rank-orderingof the singers. To arrive at an overall ranking of the singers, themethods may involve ordering absolute and/or relative measure values foreach input in order from largest to smallest. Alternatively, the methodmay comprise combining or fusing the absolute and/or relative measurevalues together for a final ranking.

Where multiple measures are used, the method may involve computing anoverall ranking by computing an average of the ranks (AR) of all themeasures foe each singer. This method of score fusion does not need anystatistical model training, but gives equal importance to all themeasures. Considering that some measures are more effective than others,the method may instead employ a linear regression (LR) model that givesdifferent weights to the measures. Owing to the success of neuralnetworks and the possibility of a non-linear relation between themeasures and the overall rank, the method may instead employ a neuralnetwork model to predict the overall ranking from the absolute and therelative measures. For experimental purposes, a number of neural networkmodels were considered. One of the neural network models (NN-1) consistsof no hidden layers, but a non-linear sigmoid activation function. Theother neural network model (NN-2) consists of one hidden layer with 5nodes, with sigmoid activation functions for both the input and thehidden layers. The models are summarized in Table II.

TABLE II summary of the fusion models # Model Description Equation 1 AREqually weighted sum of individual measure ranks$y = {\frac{1}{N}{\overset{N}{\sum\limits_{i = 1}}r_{i}}}$ 2 LR Weightedsum of measures y = b + w^(T) x 3 NN-1 MLP with sigmoid activation, y =S (b + w^(T) x) no hidden layer 4 NN-2 MLP with sigmoid activation, y =s (b⁽²⁾ + w⁽²⁾ one hidden layer with five nodes S (b⁽¹⁾ + w⁽¹⁾ ^(T) x))

In Table-II, r_(i)=is the rank-ordering of singers according to i^(th)measure, N=the number of measures, x is a measure vector, w^(i) is await vector of the i^(th) layer, b is a bias, S(.) is the sigmoidactivation function, R(.) is the ReLU activation function, y is thepredicted score, AR is the average rank and LR is the linear regression.

The performance of the fusion of the two scoring systems, i.e. fusion ofthe 8 absolute measures system and the 11 relative measures system, wasalso investigated. The methods taught herein may combine them in anyappropriate manner. One method to combine them is early-fusion where allthe scores from the evaluation measures are incorporated to get a 19dimensional score vector for each snippet of each input. Another methodof combining the measures is late-fusion, where the average of the rankspredicted independently from the absolute and the relative scoringsystems are computed.

Data Preparation

To evaluate singing quality without a reference, experiments wereconducted using the musically-motivated absolute measures, theinter-singer distance based relative measures, and the combinations ofthese measures. Discussed below are the singing voice dataset and thesubjective ground-truths used for these experiments.

A. Singing Voice Dataset

The dataset used for experiments consisted of four popular Western songseach sung by 100 unique singers (50 male, 50 female) extracted fromSmule's DAMP dataset. For the purpose of analysis, it is assumed thatall singers are singing the same song. DAMP dataset consists of 35,000solar-singing recordings without any background accompaniments. Theselected subset of songs with the most popular for songs in the DAMPdataset with more than 100 unique singers singing them. Songs were alsoselected with equal or roughly equal number of male and female singersto avoid gender bias. All the songs are rich in steady nodes and rhythm,as summarised in Table-III. The dataset consists of a mix of songs withlong and sustained as well a short duration nodes with a range ofdifferent tempi in terms of beats per minute (bpm).

TABLE III summary of the singing voice dataset. Nodes can be of short,long or mixed durations Nature of Melody Note Tempo # Song Name PitchRange duration (bpm) 1 Let it go (Frozen) More than an octave Mix 68 2Cups (Pitch Perfect) Within an octave Short 130 3 When I was your manMore than an octave Mix 73 (Bruno Mars) 4 Stay (Rhianna) Within anoctave Mix 112

The methods disclosed herein may employ and autocorrelation-based pitchestimator to produce pitch estimates. For example, the pitch estimatesmay be determined from the autocorrelation-based pitch estimator PRAAT.PRAAT gives the best voicing boundaries for singing voice with the leastnumber of post-processing steps or adaptations, when compared to otherpitch estimators such as source-filter model based STRAIGHT and modifiedautocorrelation-based YIN. The method may also apply a genericpost-processing step to remove frames with low periodicity.

B. Subjective Ground-Truth

To validate the objective measures for singing evaluation, subjectiveratings are required as ground-truth. Consistent ratings can be obtainedfrom professionally trained music experts. However, obtaining suchratings at a large scale may not be always possible, as it can be timeconsuming, and expensive. Crowd sourcing platforms, such as Amazonmechanical turk (MTurk), is effective to obtain reliable human judgmentsof singing vocals. Ratings provided by MTurk users demonstrablycorrelated well with ratings obtained from professional musicians in alab-controlled experiment. The Pearson's correlation betweenlab-controlled music-expert ratings and filtered MTurk ratings forvarious parameters are as follows: overall singing quality: 0.91, pitch:0.93, rhythm: 0.93, and voice quality: 0.65. Given the high correlation,MTurk was used to derive the subjective ground-truth for presentexperiments.

While it is possible that professional musicians rate singing quality atan absolute scale of 5 consistently, the ratings through crowd sourcingare less certain. Also, absolute ratings are known to not discriminatebetween items, and each rating on the scale is not precisely defined.Therefore, the present methods used in experimental assessments employeda relative rating called best-worst scaling (BWS) which can handle along list of options and always generates discriminating results as therespondents are asked to choose the best and worst option in a choiceset. At the end of this exercise, the items can be rank-orderedaccording to the aggregate BWS scores of each item, given by:

$\begin{matrix}{B = \frac{n_{best} - n_{worst}}{n}} & (16)\end{matrix}$

where n_(best) and n_(worst) are the number of times the item is markedas best and worst respectively, and n is the total number of times theitem appears.

The Spearman's rank correlation between the MTurk experiment and thelab-controlled experiment was 0.859.

A pairwise BWS test was also conducted on MTurk where a listener wasasked to choose the better singer among a pair of singers singing thesame song. One excerpt of approximately 20 seconds from every singer ofa song (the same 20 seconds for all the singers of a song) waspresented. There are ¹⁰⁰C₂ number of ways to choose 2 singers from 100singers of a song, i.e. 4,950 Human Intelligence Tasks (HITs) per song.This experiment was conducted separately for each of the 4 songs ofTable-III. Therefore there were in total 4,950×4=19,800 HITs.

Filters were applied to the MTurk users. The users were asked for theirexperience in music and to annotate musical notes as a test. Theirattempt was accepted only if they had some formal training in music, andcould write the musical notations successfully. A filter was alsoapplied on the time spent in performing the task to remove the lessserious attempts where the MTurk users may not have spent time listeningto the snippets.

Experiments

In the sections entitled Inter-singer measures and Musically-motivatedmeasures, various musically-motivated absolute and relative objectivemeasures were designed. It is expected that these measures can assessthe inherent properties of singing quality that are independent of areference. When the absolute and relative measures are appropriatelycombined, a leader board of singers can be generated ranked in the orderof their singing ability. FIG. 7 shows the overview of this framework700, in which Singer A (the singer in question) provides a first input702. The first input 702 is a recording of the singing voice of SingerA. One or more further inputs 704 are received, which in the presentembodiment include a recording by Singer A but in other embodiments maynot. A pitched histogram is developed for Singer A (at 706), from whichabsolute measures are determined (at absolute scoring system 708).Notably, the absolute measures do not reference the one or more furtherinputs 704. Various features, such as MFCC, pitch contour and/or pitchedhistogram, are calculated for the first input 702 (at 710) and for theone or more further inputs 704 (at 712). These features are inputtedinto a relative scoring system 714 that scores the first input 702relative to the one or more further inputs 704. The scores produced bythe absolute scoring system 708 and the relative scoring system 714 arefused at system fusion module 716. The system fusion module 716determines the quality of the singing voice for the singer in question.The same process can be undertaken for additional singing voices, all ofwhich can then be ranked on leaderboard 710. In the present case, theanalysis of the voice of Singer A may include using all of the one ormore further inputs 704 except the input provided by Singer A. The sameanalysis can then be conducted for each individual input of the one ormore further inputs 704, in a leave one out data set—i.e. input 702 maytaken from the one or more inputs 704, and relative measures for input702 can then be determined with reference to each remaining input of theone or more inputs 704.

Various methods to combine the absolute and relative measures wereexplored, as discussed under the heading “Ranking strategy and fusionmodels—B. Strategy for score fusion”. The rank-order of the individualmeasures are averaged to obtain an average rank (AR). The linearregression model was trained, and the two different neural networkmodels (NN-1, NN-2) were employed in 10-fold cross-validation. Theabsolute and relative measure values are the inputs to these networks,while the human BWS scores given in Equation (16) are the output valuesto be predicted. The loss function for the neural networks is the meansquared error, with adam optimiser. It was ensured that, in every fold,an equal number of singers are present from every song, both in trainingand test data. All computations are done using scikit-learn.

To validate the present hypothesis, several experiments were conducted.The role of the absolute and the relative measures were investigatedindividually in predicting the overall human judgment, and the methodsof combining these measures. The influence of the duration of a songexcerpt for computational singing quality analysis was also observed.Moreover, the ability of the present machine-based measures was comparedwith humans in predicting the performance of the underlying perceptualparameters.

In this regard, the baseline system performance from literature, and theachievable upper limit of performance in the form of the human judges'consistency in evaluating singing quality is useful to understand.

A. Baseline

The global statistics kurtosis and skew were used to measure theconsistency of pitch values. These are two of the presently presentedeight absolute measures. Moreover, the Interspeech ComParE 2013(Computational Paralinguistics Challenge) feature set can be used as abaseline. It comprises of 60 low-level descriptor contours such asloudness, pitch, MFCCs, and their 1st and 2nd order derivatives, intotal 6,373 acoustic features per audio segment or snippet. This sameset of features was extracted using the OpenSmile toolbox to create thepresent baseline for comparison. A 10-fold cross-validation experimentwas conducted using the snippet 1 from all the songs to train a linearregression model with these features. The Spearman's rank correlationbetween the human BWS rank and the output of this model is 0.39. Thisrank correlation value is an assessment of how well the relationshipbetween the two variables can be described using a monotonic function.This implies that with the set of known features, the baseline machinepredicted singing quality ranks has a positive but a low correlationwith that given by humans.

B. Performance of Human Judges

In a pilot study, 5 professional musicians were recruited to providesinging quality ratings for 10 singers singing a song. These musicianswere trained in vocal and/or musical instruments in different genres ofmusic such as jazz, contemporary, and Chinese orchestra, and all of themwere stage performers and/or music teachers. The subjective ratingsobtained from them showed high inter-judge correlation of 0.82. Thisshows that humans do not always agree with each other, and there is, ingeneral, an upper limit of the achievable performance of anymachine-based singing quality evaluation. Thus, the goal of the presentsinging evaluation algorithm is to achieve this upper limit ofcorrelation with human judges.

C. Experiment 1: Comparison of Singer Characterization Methods UsingInter-Singer Distance

In this experiment, a preliminary investigation was performed to comparethe three singer characterization methods discussed in under the headingInter-singer measures—Singer characterisation using inter-singerdistance. The relative measures were obtained from these methods foreach of the 11 inter-singer distance measures. FIG. 8 shows theSpearman's rank correlation of the human BWS ranks with ranks from theserelative measures used with the six models of Table II, over the snippet1 of all the 4 songs for the three methods. To observe the best casescenario for method 1, its distance threshold is optimized for eachmeasure for snippet 1. The number of singers threshold for method 2 isempirically set as 10 singers, assuming that roughly at least tenpercent of singers in a large pool of singers would be good. In thisway, if the distance of a particular singer from the 10^(th) nearestsinger is small, it means that the singer sings very similarly to 10singers, thus the singer is good.

It was observed that method 2 (k^(th) nearest distance method) performsbetter than the other two methods for all the six models. The resultsuggests that our assumption that at least ten percent in a pool ofsingers would be good, serves our purpose. Method 3, i.e. the median ofthe distances of a particular singer from the rest of the singersassumes that half of the pool of singers would be good singers, which isnot a reliable assumption, therefore this method performs the worst.

With the preliminary findings, it was determined that the relativemeasures should be computed using method 2 in the rest of theexperiments. Thus, while the present methods may employ any one ofmethods 1 to 3 is assessing inter-singer distance measures, a preferredembodiment employs method 2.

D. Experiment 2: Evaluating the Measures Individually

An analysis was then performed as to how well each of the absolute andrelative measures can individually predict the ranks of the singers.FIG. 9 shows the Spearman's rank correlation of each of the 8 absoluteand the 11 relative score vectors with the human BWS ranks. It is clearthat all the derived measures show a positive correlation with humans,although some correlate better than others. The Autocorr measure showsthe best correlation among the absolute measures. This suggests that theinterval pattern of the dominant notes in the histogram carry importantinformation about singing quality. Thus, in a preferred embodiment, themethod assessing singing quality of the first input (and other inputs asnecessary) by computing the interval pattern of dominant notes is aninput. The PeakConc₅₀ shows better performance than PeakConc₁₁₀, whichagrees findings in literature that the human ear is sensitive to changesin pitch as small as 25 cents.

The relative measures, in general, perform better than the absolutemeasures, which means that the inter-singer comparison method is closerto how humans evaluate singers. The pitch-based relative measuresperform better than the rhythm-based relative measures. This is anexpected behaviour for karaoke performances, where the background musicand the lyrical cues help the singers to maintain their timing.Therefore, the rhythm-based measures do not contribute as much in ratingthe singing quality. Among the relative measures, pitchhist120DDistanceperforms the best, along with the KL-divergence measures, showing thatinter-singer pitch histogram similarities is a good indicator of singingquality. The pitch_med_dist measure follows closely, indicating that thecomparison of the actual sequence of pitch values and the duration ofeach note give valuable information for assessing singing quality. Theseaspects are not captured by pitch histogram-based methods.

Another interesting observation is the high correlation of thetimbral_dist measure. It indicates that voice quality, represented bythe timbral distance, is an important parameter when humans comparesingers to assess singing quality. This observation supports thetimbre-related perceptual evaluation criteria of human judgment such astimbre brightness, colour/warmth, vocal clarity, strain. The timbraldistance measure captures the overall spectral characteristics, thusrepresents the timbre-related perceptual criteria.

E. Experiment 3: Absolute Scoring System: The Fusion of AbsoluteMeasures

In this experiment, the performance of the combination ofmusically-motivated pitch histogram-based absolute measures, introducedin the section entitled Musically-motivated measures in ranking thesingers, was evaluated. Table IV shows the Spearman's rank correlationbetween the human BWS ranks and the ranks predicted by absolute measureswith different fusion models. Four different snippets were evaluatedfrom each song and the ranks were averaged over multiple snippets. Thelast column in Table-IV shows the performance of the absolute measuresextracted from the full song (more than 2 minutes' duration) (AbsFull)combined with the individual snippet ranks.

TABLE IV evaluation of absolute measures. The values in the table areSpearman's rank correlation between the human BWS ranks and the machinegenerated ranks (all P-values < 0.05) Snippet Snippet Snippet Snippet1 + 2 + 3 + 4 + Model # Snippet 1 1 + 2 1 + 2 + 3 1 + 2 + 3 + 4 AbsFull1 0.3556 0.4134 0.4702 0.4796 0.4796 2 0.3695 0.3879 0.4143 0.42050.4558 3 0.3329 0.3567 0.3917 0.3975 0.4331 4 0.3073 0.3372 0.38660.3838 0.4228 5 0.3924 0.4589 0.4781 0.4711 0.4942 6 0.386 0.4475 0.4650.4603 0.4887

1) Effect of duration: The pitch histogram for the full song is expectedto show a better representation than the histogram of a snippet of thesong because more data results in better statistics. As seen in TableIV, with an increase in the number of snippets, i.e. increase in theduration of the song being evaluated, the predictions improve, with theone with the full song performing the best. This indicates that moredata (˜80 seconds) provides better statistics, therefore, betterpredictions, while humans can judge reliably by a shorter duration clipof ˜20 seconds.

2) Effect of the score fusion models: As some absolute measures are moreeffective than others, the weighted combination with non-linearactivation functions (Models 5 and 6) show a better performance than theequally weighted average of ranks (Model 1). One hidden layer in theneural network model (NN-2) performs better than the one without ahidden layer (NN-1), as well as the LR model. This indicates thatnon-linear combination of the measures provides a better prediction ofhuman judgement. Interestingly, the average of ranks (Model 1) performscomparably with NN-2, suggesting that all measures are informative inmaking a meaningful ranking. It also indicates that although themeasures individually may not have performed equally well (FIG. 9), eachof them captures a different aspect of the pitch histogram quality,therefore, combining them with equal weights results in a comparableperformance.

It is important to note that there are specific conditions when theabsolute measures fail to perform. By converting a pitch contour into ahistogram, information about timing or rhythm is lost. The correctnessof the note order also cannot be evaluated through the pitch histogram.Moreover, the relative positions of the peaks in the histogram cannot bemodelled without a reference, i.e. incorrect location of peaks goesundetected. For example, if a song consists of five notes, and a singersings five notes precisely but they are not the same notes as thosepresent in the song, then the absolute measures would not be able todetect the erroneous singing. The pitch histogram also loses informationabout localized errors, i.e. errors occurring for a short duration.According to cognitive psychology and PESnQ measures, localized errorshave greater subjective impact than distributed errors. Therefore, if asinger sings incorrectly for a short duration, and then correctshimself/herself, the absolute measures are unable to capture it.

F. Experiment 4: Relative Scoring System: Evaluating the Fusion ofRelative Measures

In this experiment, the performance of the combination of theinter-singer relative measures computed from method 2, discussed inunder heading “EXPERIMENTS—Experiment 1: Comparison of SingerCharacterization Methods using Inter-Singer Distance”, wereinvestigated. Table V, third column shows the Spearman's rankcorrelation between the human BWS ranks and the ranks predicted by therelative measures with the different fusion models. Four snippets wereevaluated from each song and ranks were averaged over the snippets.Again, preliminary experiments suggested that samples of longer durationlead to better statistics and, therefore, more accurate scores.

TABLE V summary of the performance of absolute and relative measures,and their combinations. The values in the table are Spearman's rankrelation between human BWS ranks and the machine generated ranksaveraged over for snippets, (all P-values < 0.05) Model AbsoluteRelative Early- Late- # measures measures fusion fusion 1 0.4796 0.63960.6877 0.7059 2 0.4205 0.5737 0.6413 0.6426 3 0.3975 0.5799 0.63850.6407 4 0.3838 0.5688 0.6222 0.6274 5 0.4711 0.6153 0.6636 0.6692 60.4603 0.602 0.6623 0.6678

The combinations of the relative measures result in a better performancethan the combinations of the absolute measures. This follows from theobservation in Experiment 2 (Evaluating the measures individually) thatthe relative measures individually perform better than the absolutemeasures. Like the absolute measures, average of ranks (AR) performsbetter than the other score fusion models, indicating that all relativemeasures are informative in making meaningful ranking.

G. Experiment 5: System Fusion: Combining Absolute and Relative ScoringSystems

In this experiment, combinations of the 8 absolute and 11 relativemeasures were investigated by early-fusion and late-fusion methods (seeB. Performance of human judges). The rank correlation between the BWSranks and the ranks obtained from early-fusion method averaged over foursnippets is reported in column 4, Table V, and that from late-fusion isin column 5.

The results suggest that the late-fusion of the systems show a bettercorrelation with humans than early-fusion. This means that predictionsgiven separately from the absolute and the relative measures providedifferent and equally important information. Therefore, equal weightingto both shows better correlation with humans. Moreover, a simple rankaverage shows a better performance than the complex neural networkmodels. This shows that the individual measures, although showingdifferent levels of correlation with humans, individually capturedifferent information about singing quality. It is important to notethat the process of converting values to ranks is inherently non-linear.

H. Experiment 6: Humans Versus Machines

An important advantage of objective methods for singing evaluation isthat each underlying perceptual parameter is objectively evaluatedindependently of the other parameters, i.e. the computed measures areuncorrelated amongst each other. On the other hand, the individualparameter scores from humans tend to be biased by their overall judgmentof the rendition. For example, a singer who is bad in pitch, may or maynot be bad in rhythm. However, humans tend to rate their rhythm poorlydue to bias towards their overall judgment.

In this experiment, data was used where music experts were asked to rateeach singer on a scale of 1 to 5 with respect to the three perceptualparameters pitch, rhythm, and timbre individually. FIG. 10(a) shows thathuman ratings for the three perceptual parameters are highly correlatedamongst each other. On the same data, machine scores for the threeparameters show significantly less correlation (FIG. 10(b)). Thisobservation was also verified on the data used for the experiments inthis work (FIG. 10(c)). Therefore, machine scores are better than humansin giving unbiased objective feedback to a singer on the underlyingperceptual details of their rendition. This feedback can be useful to alearner for understanding how they can improve upon the individualparameters.

I. Discussion

The experimental results show that the derived absolute and relativemeasures are reliable reference-independent indicators of singingquality. With both absolute and relative measures, the proposedframework effectively addresses the issue with pitch interval accuracyby looking at both the pitch offset values as well as other aspects ofthe melody. The absolute measures such as ρ_(c), ρ_(b) and αcharacterised the pitch histogram of a given song. Furthermore, therelative measures compare a singer with a group of other singers singingthe same song. It is unlikely for all singers in a large dataset to singone note throughout the song.

The present experiments show that 100 rendition from different singersconstituted database for a reliable automatic leaderboard ranking. Theabsolute measures in the framework are independent of the singing corpussize, by the relative measures are scalable to a larger corpus.

The proposed strategy of evaluation is applicable for large-scalescreening of singers, such as in singing idol competitions and karaokeapps. In this work emphasis was given to the common patterns in singing.This work explores Western pop, to endeavour to provide a large-scalereference-independent singing evaluation framework.

Conclusions and Future Work

In this work, a method for assessing singing quality was introduced aswas a self-organizing method for producing a leader board of singersrelative to their singing quality without relying on a reference singingsample or musical score, by leveraging on musically-motivated absolutemeasures and veracity based inter-singer relative measures. The baselinemethod (A. Baseline) shows a correlation of 0.39 with human assessmentusing linear regression, while the linear regression model with thepresently proposed measures shows a correlation of 0.64, and the bestperforming method shows a correlation of 0.71, which is an improvementof 82.1% over the baseline. This improvement is attributed to:

-   -   the musically-motivated absolute measures, that quantify various        singing quality discerning properties of the pitch histogram,        and    -   the veracity based musically-informed relative measures that        leverage on inter-singer statistics and overcome the drawbacks        of using only absolute measures.

It was found that the two kinds of measures provide distinct informationabout singing quality, therefore a combination of them boosts theperformance.

It was also found that the proposed ranking technique provides objectivemeasures for perceptual parameters, such as pitch, rhythm, and timbreindependent, that human subjective assessment fails to produce.

It will be appreciated that many further modifications and permutationsof various aspects of the described embodiments are possible.Accordingly, the described aspects are intended to embrace all suchalterations, modifications, and variations that fall within the spiritand scope of the appended claims.

Throughout this specification and the claims which follow, unless thecontext requires otherwise, the word “comprise”, and variations such as“comprises” and “comprising”, will be understood to imply the inclusionof a stated integer or step or group of integers or steps but not theexclusion of any other integer or step or group of integers or steps.

The reference in this specification to any prior publication (orinformation derived from it), or to any matter which is known, is not,and should not be taken as an acknowledgment or admission or any form ofsuggestion that that prior publication (or information derived from it)or known matter forms part of the common general knowledge in the fieldof endeavour to which this specification relates.

1. A system for assessing quality of a singing voice singing a song, comprising: memory; and at least one processor, wherein the memory stores instructions that, when executed by the at least one processor, cause the at least one processor to: receive a plurality of inputs comprising a first input and one or more further inputs, each input comprising a recording of a singing voice singing the song; determine, for the first input: one or more relative measures of quality of the singing voice by comparing the first input to each further input; and one or more absolute measures of quality of the singing voice; and assess quality of the singing voice of the first input based on the one or more relative measures and the one or more absolute measures.
 2. A system according to claim 1, wherein the at least one processor determines one or more relative measures by assessing a similarity between the first input and each further input.
 3. A system according to claim 2, wherein the at least one processor assesses a similarity between the first input and each further input by, for each relative measure, assessing one or more of a similarity of pitch, rhythm and timbre.
 4. A system according to claim 3, wherein the at least one processor assesses the similarity of pitch, rhythm and timbre as being inversely proportional to a pitch-based relative distance, rhythm-based relative distance and timbre-based relative distance respectively of the singing voice of the first input relative to the singing voice of each further input.
 5. A system according to claim 2, wherein, for a second input comprising a recording of a singing voice singing the song, the at least one processor determines the singing voice of the first input to be higher quality than the singing voice of the second input if the similarity between the first input and each further input is greater than a similarity between the second input and each further input.
 6. (canceled)
 7. A system according to claim 1, wherein each absolute measure of the one or more absolute measures is an assessment of one or more of pitch, rhythm and timbre of the singing voice of the first input.
 8. A system according to claim 7, wherein at least one said absolute measure is an assessment of pitch based on one or more of overall pitch distribution, pitch concentration and clustering on musical notes.
 9. A system according to claim 8, wherein the at least one processor assesses pitch by producing a pitch histogram, and assesses a singing voice as being of higher quality as peaks in the pitch histogram become sharper.
 10. A system according to claim 1, wherein the instructions further cause the at least one processor to rank the quality of the singing voice of the first input against the quality of the singing voice of each further input.
 11. A method for assessing quality of a singing voice singing a song, comprising: receiving a plurality of inputs comprising a first input and one or more further inputs, each input comprising a recording of a singing voice singing the song; determining, for the first input: one or more relative measures of quality of the singing voice by comparing the first input to each further input; and one or more absolute measures of quality of the singing voice; and assessing quality of the singing voice of the first input based on the one or more relative measures and the one or more absolute measures.
 12. A method according to claim 11, wherein determining one or more relative measures comprises assessing a similarity between the first input and each further input.
 13. A method according to claim 12, wherein assessing a similarity between the first input and each further input comprises, for each relative measure, assessing one or more of a similarity of pitch, rhythm and timbre.
 14. A method according to claim 13, wherein the similarity of pitch, rhythm and timbre are assessed as being inversely proportional to a pitch-based relative distance, rhythm-based relative distance and timbre-based relative distance respectively of the singing voice of the first input relative to the singing voice of each further input.
 15. A method according to claim 12, wherein, for a second input comprising a recording of a singing voice singing the song, the singing voice of the first input is determined to be higher quality than the singing voice of the second input if the similarity between the first input and each further input is greater than a similarity between the second input and each further input.
 16. (canceled)
 17. A method according to claim 11, wherein each absolute measure of the one or more absolute measures is an assessment of one or more of pitch, rhythm and timbre of the singing voice of the first input.
 18. A method according to claim 17, wherein at least one said absolute measure is an assessment of pitch based on one or more of overall pitch distribution, pitch concentration and clustering on musical notes.
 19. A method according to claim 18, wherein assessing pitch involves producing a pitch histogram, and wherein a singing voice is assessed as being of higher quality as peaks in the pitch histogram become sharper.
 20. A method according to claim 11, further comprising ranking the quality of the singing voice of the first input against the quality of the singing voice of each further input. 